We are IntechOpen, the world's leading publisher of Open Access books Built by scientists, for scientists

Open access books available 5,300

130,000 155M

International authors and editors

Downloads

Our authors are among the

most cited scientists 154 TOP 1%

Selection of our books indexed in the Book Citation Index in Web of Science™ Core Collection (BKCI)

# Interested in publishing with us? Contact book.department@intechopen.com

Numbers displayed above are based on latest data collected. For more information visit www.intechopen.com

# **Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition**

Jozef Juhár, Ján Staš and Daniel Hládek *Technical University of Košice Slovakia*

# **1. Introduction**

Speech technologies have a potentiality to simplify the human-machine interaction as well as the communication between people. The use of speech technology applications has nowadays continuously growing trend. Each speech recognition system, which stands in the heart of every speech application, besides an algorithmic complexity, is strongly language dependent. Therefore, one of the challenging tasks by the development of the *Slovak large vocabulary continuous speech recognition* (LVCSR) system is a creation of an efficient language model (LM).

Development of the Slovak language model, which belongs to a group of highly inflective languages, is more laboured than creation of an English language model. First reason is that the Slovak language is characterized by a relative free order of words in sentences. This consequently leads to the problem of *data sparseness of the text data* used for training of language models (LMs). Second reason is the *inflection in the language* itself due to the rich morphology which leads to a several times larger vocabulary than in English. Therefore, amount of text data that could statistically enough cover the Slovak language is substantially higher.

Contemporary modeling of the Slovak language is based on the knowledge of modeling of the related Slavic languages, such as Czech, Polish, Serbo-Croatian or Russian language (Nouza et al., 2010). From the field of statistics, Slovak language is very similar to the Czech language, especially in forming words into sentences and determining the sentence semantics. In the contrast, from linguistic point of view, mainly in phenomena of inflection and assimilation in voice, Slovak is more or less similar to the Polish. Therefore, for statistical language modeling it is appropriate to be limited with linguistic constraints as well.

This chapter describes results of the Slovak language model development for judiciary domain-specific LVCSR task and broadcast news transcription. During this process, we have coped with several problems in text preprocessing, selection of the basic statistical methods used in modeling of the other similar languages and adaptation into the area of application. Another important part in the Slovak language modeling has been optimization of the resultant model, which introduced phonetic and linguistic relations between words. These optimization steps have caused an improvement in quality of our LM as well as recognition accuracy of the LVCSR system itself.

This chapter is organized as follows. Section 2 introduces the process of text gathering and preprocessing text corpora used in training LMs. Section 3 describes the process of creation a vocabulary of the Slovak language. In the Section 4 the selection of appropriate smoothing technique, method for the adaptation to the given domain and optimal pruning algorithm are presented. Some proposed optimization approaches in modeling of the Slovak language are summarized in Section 5. Section 6 presents the setup of the Slovak LVCSR system used in real task of domain-oriented speech recognition. At the end of this chapter in Section 7, the experimental results are summarized. Section 8 closes this chapter with the discussion.

# **2. Text data and preprocessing**

Small languages of Eastern Europe, such as the Slovak language, can be considered as under-resourced, because they usually suffer from the lack of audio databases and linguistic resources. Then, the main assumption in the process of creation an effective LM for any language is to collect and consistently process a large amount of text data entering into the process of training LM. Therefore, we have proposed an automatic system, called *webAgent* (Hládek & Staš, 2010a), which retrieves text data from various web pages written in Slovak language. Moreover, the text gathering system is able to detect the character encoding of the given web page, to collect links to other web pages and to retrieve text data from DOC (MS Word), RTF or PDF documents as well.

Before training LMs it has been necessary to transform the text data into pronunciation form. These text preprocessing steps include: (a) *word tokenization*, (b) *text normalization*, (c) *sentence segmentation* and (d) *filtering of grammatically incorrect sentences* (Hládek & Staš, 2010b).

The most important preprocessing operation is text normalization, for which the following rules has been proposed:


When preprocessing the domain-specific text data from the field of judicature we had to resolve the problem of transcription of a large amount of specific abbreviations and numerals Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition 3 Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition


Table 1. Statistics of text corpora

as well (Staš et al., 2010b). Normalized documents are then stored in relational database based on PostgreSQL along with their titles, URIs of web pages, and names of sources where they were published. It should be noted that database is closely associated with the system for text gathering. In the process of insertion text data into database the duplicity verification is performed. Nowadays, we are dealing with text corpus of size about 1.9 billion of tokens in more than 100 million of sentences. The text corpus is divided into several different domain-related sub-corpora (see Table 1).

It should be noted that for filtering of grammatically incorrect words we have used our spellcheck lexicon, created by merging available Open Source dictionaries such as *aspell*, *hunspell* and *ispell* (sk-spell, 2010) with lists of proper nouns, geographical items and various name entities available on the Internet. The size of our lexicon for spell-checking is about 1.25 million of unique words (Staš et al., 2011a).

# **3. Vocabulary**

Vocabulary which have been used in language modeling was selected from collected text corpora using standard methods based on the *highest occurrence* words in the training corpora and *maximum likelihood* approach (Venkataraman & Wang, 2003) for selection domain-specific words from the field of judicature. The vocabulary was then extended to the number of names and surnames, geographical items, names of various institutions and some other name entities in the Slovak Republic, as can be seen in the Table 2.


Table 2. Vocabulary

We have also proposed an automatic tool for generating inflective word forms for names and surnames which were used in modeling of the Slovak language, (a) in the on-line dictation LVCSR system as an *independent model of names and surnames*, and later (b) in *modeling of names and surnames using word classes* conditioned by their grammatical category.

We have found that in modeling of the Slovak language with currently available text data the optimal results were achieved if the vocabulary size is about 100 − 150 thousand of words for the domain-specific and about 300 − 350 thousand of words in general domain task of speech recognition. It should be noted that all words in vocabulary were manually checked and corrected by linguistic experts.

### **4. Statistical modeling of the Slovak language**

In the following sections, selected methods like smoothing, adaptation, combination and pruning are summarized. The most suitable algorithms were later used in training of the reference Slovak language model, as described in Section 6.1.

#### **4.1 Language model**

In general, language model determines the probability of the sequence of words as well as the word itself, what consequently helps the decoder to find the most probable sequence of words, which corresponds to the acoustic information pronounced by the user. Contemporary language modeling is based on the use of *n*-grams, which mainly consider the statistical dependency between *n* individual words.

Formally, the main aim of the *n*-gram model is to determine a priori probability *P*(*W*) of a sequence of words *W* = {*w*1*w*<sup>2</sup> ... *wn*−1} and to provide the quickest and the most exact estimation of this sequence of words in decoding process of a LVCSR system. This probability can be defined as follows

$$P(W) = P(w\_1 w\_2 \dots w\_{n-1}) = \prod\_{i=1}^{n} P(w\_i | w\_1 w\_2 \dots w\_{i-1}) \tag{1}$$

where *P*(*w<sup>i</sup>* |*w*1*w*<sup>2</sup> ... *wi*−<sup>1</sup> ) is the conditional probability of word *w<sup>i</sup>* conditioned by its history {*w*1*w*<sup>2</sup> ... *wi*−1}. Such process of decomposition allows us to recognize for LVCSR system a sequence of words during its pronunciation and determines the probability *P*(*W*) for searching strategy in decoding process gradually.

The main advantage of using *n*-gram models in LVCSR lies in relative easy computating their probability estimations based on computating the relative occurrence of words, or word sequences in the training data set using *maximum likelihood* approach (Jurafsky & Martin, 2009; Manning & Schütze, 1999).

#### **4.2 Smoothing**

As it was mentioned earlier, for dealing with problem of *data sparseness*, some re-estimation methods such as discounting, interpolation or backing-off also called smoothing are used in statistical language modeling (Jurafsky & Martin, 2009).

Due to the fact that a speaker can also pronounce a sentence does not occuring in the training data set, cause that the probability of such events can lead to the zero. Therefore, the problem of zero probabilities leading to errors in the recognition is resolved by smoothing of the language model. Smoothing uniformly redistributes parts of probabilities of observed *n*-grams among *n*-grams which are not observed in training data set. Nowadays, there exist several different smoothing techniques, such as *additive Add-One* or *Add-δ smoothing*, *Ristad natural law*, *Good-Turing estimation*, *Katz back-off model*, *absolute* and *linear discounting*, *Witten-Bell model* (Manning & Schütze, 1999), or *Kneser-Ney smoothing* and its modifications, which use counting of *n*-grams or counting these counts in computing discounting constants in smoothing LMs (Chen & Goodman, 1996).

We observed that among all smoothing techniques in modeling of the Slovak language, the optimal results produce followed algorithms:


# **4.3 Adaptation and combination**

In the process of enhancing the performance of the LVCSR system, the *language model adaptation* (LMA) plays an important role in case of domain-specific speech recognition. The basic idea of LMA is to use a small amount of domain-specific text data to adjust LMs to reduce the impact of languages differences between the training and testing text data and set the parameters for independent topic-dependent LMs to correspond domain as much as possible with the real conditions of LVCSR application. LMA includes not only statistical dependencies between words in given language, but also the frequency of word occurrences, structure of the text data and further additional information that usually come from the field of linguistics and phonology (Staš et al., 2010a).

The LMA is usually performed by combining several (different) topic-dependent LMs when adaptation text (held-out data set) is used for adjusting the parameters of these LMs. In recent years, many different techniques have been designed for adaptation and combining LMs, including *maximum a posteriori* (MAP) approaches such as *count merging* and *linear*, *log-linear* or *generalized linear interpolation* (Gao et al., 2006; Hsu, 2009) and some discriminative methods such as LMA based on *minimum discriminative information*, *boosting* and *perceptron algorithm* or *minimum sample risk method* (Gao et al., 2006), which come from *maximum entropy* approach.

We have observed that algorithms producing significant results for strong statistically dependent languages such as English language, do not bring notable improvement in modeling the Slovak language. Based on detailed analysis experimental results methods for adaptation and combination LMs published in (Staš et al., 2010a), we also achieved that usage of the *linear interpolation* or its generalized alternative for the Slovak language is more than sufficient and interpolation weights should be adjusted using *expectation-maximization* (EM) algorithm by minimization of perplexity on held-out data set.

# **4.4 Pruning**

Typically, an uncompressed LM in highly inflective language is comparable in size to the text data on which it has been trained. To build LMs for the task of real-time application it is necessary to limit the size of the resultant LM. In highly inflective languages, with using a large vocabulary increases the number of *n*-grams in LM which may occur in the training set just once or twice and do not have a big impact on the quality of LM or accuracy of the recognition system. Therefore, these *n*-grams can be excluded from the LM using pruning.

There exist several criteria for pruning LMs. To create an efficient and compact model of the Slovak language for using in real-time application of LVCSR system we observed the influence on the quality of LM of following pruning methods: (a) *cutoff counts*, (b) *weighted difference method* (Seymore & Rosenfeld, 1996), and (c) *pruning based on relative-entropy* (Stolcke, 1998). We found out that the *relative entropy-based pruning* achieved the best results.

# **5. Model optimization**

Several different techniques and principles have been used and proposed in order to get an efficient model of the Slovak language for off-line and on-line speech recognition. These so-called optimization techniques which include the statistical, linguistical and phonetical principles and practices and lead to the increasing the quality of language models, decreasing errors in LVCSR system and usability of these models in real conditions of speech recognition in Slovak are described in following sections.

#### **5.1 Spelling pronunciation**

One of the main problems in speech recognition having the significant influence on the overall result of the speech recognition is how to implement the best phonetic transcription of words contained in dictionary. The *transcription of words from orthoepic to the ortographical form* concerns also such words as *abbreviations* or *acronyms* usually spelled character-by-character, for example: IBM, PhD., P. O. Box, etc. These events were necessary to unify, also to define their transcription to the Slovak phonetic alphabet (Cer ˇnak et al., 2003) and to assign them all possible pronunciation variants. Regarding to the Slovak language, we detect about 620 abbreviations and acronyms (510 alternative pronunciations) in the text corpora mentioned in Section 2 and manually modified their transcription under linguistic rules used in the Slovak language.

#### **5.2 Modeling of noise events**

Spontaneous speech is also characterized by various *non-speech sounds* or *expressions* which are mainly generated by the speaker or surrounding environment. On the analysis of the resulting hypotheses obtained from the output of our dictation LVCSR system, we encountered relatively a lot of mistakes at the beginning of the speech or after long pause, in situations where the speaker paused, coughed, lip smacked, etc. We have decided to explore such ways in which it would be possible to model these so-called *noise events* in Slovak language modeling without having a knowledge of their occurrences in the training data set and false increase the estimate of their probabilities. Since the locations of the noise events are usually tagged by annotators during transcription or annotation of speech recordings into

text by special tags, we have decided to include these *annotations of speech recordings* with noise tags into the process of training LMs and model the Slovak language by using selected noise events as well.

First, we had to map all noise tags contained in annotations into five groups: (a) *short pause*, (b) *long pause*, (c) *filled pause*, (d) *background* and (e) *speaker noise* (Staš et al., 2010b) and these were later included into the dictionary and used in language modeling. It is important to say that after recognition, these noise events have appeared in output like *transparent words*.

# **5.3 Multiwords in Slovak language modeling**

As it was mentioned in previous section, the most common mistakes in speech recognition arise at the beginning of the speech or after long pause and these also can be caused by *misrecognition of short monosyllabic words* consisting of no more than three or four characters. These words are often added to the following or preceding word, recognized as a noise or ignored (Kolorenˇc et al., 2006). To avoid this problem, it is suitable to model these events using *multiword expressions* (MWEs).

It has been showed that MWEs in the form of connection of short (monosyllabic) word with long (di-, tri- or polysyllable) word, which is usually more recognizable, can help increasing the recognition accuracy of the given short word. Moreover, using MWEs increases the order of *n*-gram LM and decreases the number of pronunciation variants depending on the context of the given word, because in an inflective language some of the words are pronounced differently in different context.

The extraction of MWEs in the Slovak language was performed by following selection criteria (Staš et al., 2011b):


For the process of selection multiwords, we have used the standard statistical measures based on *absolute* and *relative co-occurrence* and *pointwise mutual information* (PMI) of these word pairs in the text corpora limited by the linguistic constraints. Selection measures was intentional. Absolute frequency expresses the most frequented events in given language. Relative frequency in the context of the first word extract MWEs such part-of-speech in Slovak as prepositions, conjunctions or pronouns usually occuring in the first place of given MWE. PMI reflects collocations which do not occur in language frequently but usually have certain meaning.

Linguistic constraints come from the observations of the behaviour of a LVCSR system in the process of testing LMs. It have been discovered that our LVCSR system is often mistaking in following cases: (a) there was an *assimilation of voicing* on a word boundaries and (b) if a first word in MWE ended with *same letter* as the second word begins.

Using mentioned and proposed methodology for extraction MWEs from the text corpora we obtained about 3 000 word pairs (561 pronunciation variants) which were included into dictionary with phonetic transcription and into the process of training Slovak language models.

## **5.4 Class-based models**

Another problem by using LVCSR system is a possibilty to insert new words into the dictionary and LM. The similar problem arises also in recognizing proper nouns such as names, surnames or geographical names and other name entities. The recognition of names and surnames is one of the key properties of the on-line dictation LVCSR system which has noticeable influence on its usability in real conditions. There are different suboptimal solutions, by which we can cover a large part of the vocabulary in given language and also we can deal with the problem of insertion new words without overtraining LM. One of these solutions are *class-based* LMs which have great importance in dealing with certain problematic tasks, because they generalize context dependency of also such words, which have not occurred in the training corpora yet. We decided to use class models in modeling names and surnames in Slovak, in order to easily extend the class of words just for this case and resolve the problem of insertion new words into the dictionary.

For this purpose, we developed the *rule-based morphological tagger for names and surnames*, which is based on pattern matching principle from predefined set of names and surnames (patterns) conditioned by their grammatical category. The accuracy of this approach is then limited just by the number of patterns and selected rules. In this case the principle based on semantic similarity of formal expressions and syntactic knowledge contained in grammatical category of a proper noun is used. Using this approach we replaced thus approximately 24 818 unique inflected forms of names and surnames with one from the set of 20 morphological tags which have been depended on the case of given proper noun.

Also, it is important to say that for increasing of recognition accuracy we have created an *independent model for names and surnames* which can be used in special dictation mode in our dictation LVCSR system in the Slovak language as a parallel model of a primary domain-specific LM from the field of judicature.

# **5.5 Morphology**

The inflection in Slovak language usually occurs on the border of stem and endings. This knowledge can help modeling of unknown words or words with a low occurrence in training corpus using *morpheme-based models* (Byrne et al., 2000; Creutz et al., 2007). Dividing singletons or words with a low frequency in the training corpus into morphemes, it is statistically possible to cover such events which do not occur in dictionary and LM. The knowledge of morphology of the given language then allows to also generate new word forms, for example as it was in the case of declination of names and surnames described in Section 3 or Section 5.4.

## **5.6 Augmentation statistics of** *n***-grams**

Nowadays, research in the language modeling is oriented on the augmentation of statistics of bigrams or trigrams from other resources than by gathering a large amount of text data of given language. Statistics of seen or unseen *n*-grams can be obtained by using:


3. machine translation systems in *translation n-grams* from other (similar) languages.

At the end of this section, it is important to say that contemporary modeling of the Slovak language uses only the text data (trigrams) obtained from the Slovak National Corpus (SNC) (Šimková, 2006) to the augmenting the statistics of *n*-grams used in training LMs. However, the research and development in the other mentioned areas does not lag.

# **6. Speech recognition setup**

In the following sections, the setup of our LVCSR system and description about proposed methodology of training Slovak LMs, used annotated speech databases, acoustic modeling and data for testing LMs is presented. The setup of LVCSR system was adjusted to the testing of LMs oriented to the judicial domain and broadcast news transcription in the Slovak language.

# **6.1 Language modeling**

Experiments have been performed with trigram LMs which were created using tools contained in the *SRI Language Modeling* (SRILM) *Toolkit* (Stolcke, 2002) with vocabulary mentioned in Section 3. The complete process of building the reference LM of the Slovak language can be resumed into following steps:


# **6.2 Acoustic modeling**

The triphone context-dependent acoustic models based on the *hidden Markov models* (HMM) have been used, where each state have been modeled by 32 Gaussian mixtures. The models have been generated from feature vectors containing 39 *mel-frequency cepstral* (MFC) *coefficients*. They have been trained on two databases of annotated speech recordings.

The first broadcast news speech database contains about 60 hours of readings mostly by professionally trained speakers recorded from Slovak TV broadcast news from 2007 to 2009 year. The database is characterized by gender balanced speakers, contains read, spontaneous and in a small amount also telephone speech with 48 *kHz* sampling frequency and 16 *bit* resolution.

The second judiciary speech database contains about 120 hours of reading real adjudgments from the court with personal data changed, recorded in studio conditions and about 130 hours of read phonetically rich sentences, newspaper articles, internet texts and spelled items, recorded in offices and conferencial rooms. The database, total size of 250 hours, was recorded from 250 gender balanced speakers with 48 *kHz* sampling frequency and 16 *bit* resolution. It has been then extended with about 100 hours of 90% male spontaneous speech, recorded from 120 speakers at council hall with 44 *kHz* sampling frequency and 16 *bit* resolution.

All recordings were later downsampled to 16 *kHz* for training and testing. The databases were annotated by team of trained annotators using the *Transcriber annotation tool* (Barras et al., 2001), slightly adapted to our need, twice checked and corrected.

For acoustic modeling rare triphones the *effective triphone mapping* algorithm was used (Darjaa et al., 2011). With reference to the authors, this knowledge-based triphone tying, which allows the synthesis of unseen triphones, outperforms standard tree-based state tying for acoustic models with 4 000 states and more, whereas for acoustic models with smaller number of states the performance is equal.

## **6.3 Phonetic transcription**

Phonetical transcription selected words contained in vocabulary was performed using *data-driven approach to orthoepic transcription in the Slovak language* (Cer ˇnak et al., 2003) with slight modifications. It has been trained using the phonetically rich sentences from the SpeechDat-E and MobilDat-SK Slovak speech databases (Rusko et al., 2006) with a new sentence-based pronunciation lexicon, and additional sentences with manually annotated pronunciation from a regional broadcast news speech corpus.

### **6.4 LVCSR decoder**

For decoding, the *high-performance LVCSR engine Julius* (Lee et al., 2001) with recognition algorithm based on the two-pass strategy has been used. The input data using this algorithm are processed in the first pass with left-right bigram LM, and the final search for reverse right-left trigram model is performed again using the result of the first pass to narrow the search space.

#### **6.5 Test data set**

The first test data set was represented by 240 minutes of recordings obtained by randomly selected segments from broadcast news speech database. These segments were not used in the training acoustic model and contain 40 656 words in 4 343 sentences.

The second test data from the field of judicature were represented by 315 minutes of recordings obtained also by randomly selected segments from each speaker contained in the second read (250 hours) speech database. As well as in the first case, these segments were not used in training and contain 41 878 words in 3 426 sentences and phrases. We have decided to use also phrases in the second test set, because in real conditions, people make pause not only on the sentence boundaries, but also on phrase boundaries, usually before conjunctions.

# **6.6 Evaluation**

Two standard measures have been used for evaluation of the LM: (a) extrinsic evaluation using *word error rate* (WER) and (b) intrinsic evaluation based on *perplexity* (PPL) calculated on a test data set. WER is a standard measure of the performance of the LVCSR system, computed by comparing reference text read by a speaker against the recognized result and takes into account insertion, deletion and substitution errors. If the LVCSR system is not available, the perplexity is often used for evaluation. It is defined as the reciprocal of the (geometric) average probability assigned by the LM to each word in the test set. This measure does not necessarily evaluate the accuracy of recognition itself, but usually highly corelates with it.

# **7. Experimental results**

The experiments were oriented on the evaluation of WER and PPL on the test data set to discover the effect of proposed optimization techniques and principles in Slovak language modeling on the overall recognition accuracy of the LVCSR system. As it was mentioned in Section 6.1, the experimental results were performed with trigram LMs created with vocabulary size of 348 255 unique words or more, listed in the Table 3, and smoothed by using *modified Kneser-Ney algorithm* in any case. For adaptation and combination LMs trained independently on text corpora mentioned in the Table 1, standard *linear interpolation* have been used, where interpolation weights were adjusted to the selected domain using EM algorithm. The experiments were oriented to the off-line testing of LMs, where the emphasis is focused on the best recognition accuracy than to the memory requirements of application as in on-line speech recognition, where it is necessary to use one of the pruning techniques of LMs. In the case of pruned models, it would be difficult to find appropriate pruning threshold, to maintain the equal number of *n*-grams in LM and compare the contribution of given LM to the speech recognition.

To observe the impact of selected optimization techniques and principles to the area of speech recognition training and testing of LMs were performed in two independent areas: (a) for broadcast news transcription task and (b) in judicial domain. This step also includes the usage of appropriate acoustic model and speech recordings for testing, described in Section 6.2 and Section 6.5, respectively. Experimental results for both tasks in Slovak LVCSR are described in following sections.

## **7.1 Broadcast mews transcription**

Broadcast news transcription task is directed to the general area of the speech recognition, usually for recognition and transcription of a continuous spontaneous speech. In modeling of the Slovak language and adaptation to this domain we achieved following results. As we can see in the Table 3, using adaptation into the general area of speech recognition represented by randomly selected sentences from broadcast news text corpora not used in training process, we achieved almost 1.39% decreasing in WER and 17.36% of PPL, relatively. In the next step, modifying rules of phonetic transcription for spelled abbreviations, we observed moderate improvement rather in subjective than in objective point of view. This fact is caused also by the undesirable shortening of the history for some abbreviations such as P. O. Box, M. D., etc., and reducing predictive ability of the LM. Extending the training data set by the text data obtained


Table 3. Experimental results for off-line testing of the Slovak LVCSR system

from annotations of speech recordings we achieved additional decreasing, relatively 0.75% WER and about 2% of PPL. Taking into account that the testing data from general domain contained only small amount of selected MWEs and names or surnames, the contribution to the speech recognition of established multiwords and word classes into LM was too small. Variations were observed only in perplexity, which was increased due to the shortening of the history for MWEs and on the contrary decreased by more fixed connections between word classes. The significant improvement we achieved mainly in the case of augmentation of trigrams from the SNC database. Decreasing of about 3% WER and 5% of PPL relatively, results in the fact that the SNC database contained mostly the text data from newspapers or fictions. The impact of selected optimization techniques to the broadcast news transcription task in Slovak LVCSR brought overall reduction approximately 5.57% WER and 28.24% of PPL, relatively.

#### **7.2 Speech recognition in judicial domain**

This domain was selected as one of the most challenging acoustic and linguistic environments from the research point of view, and based on market demand, from the development point of view. Regarding adaptation into the judicial domain, we achieved significant improvement, relatively 12.12% in WER and 20.78% of PPL even if a small amount of adaptation data was added. As it was in the previous case of broadcast news transcription, by modifying pronunciation of spelling items, there were not observed any notable variations in WER or PPL. The impact of the text data from annotations of speech recordings results in significant decreasing of both values, more than 10% in WER and 40% of PPL, relatively. This fact is caused mainly by larger amount of text data (more hours) from annotations of speech recordings from judicial domain than in broadcast news transcription task. Multiwords brought an improvement in just about 5% of cases at the beginning of the speech or after long pause, what did not produce significant changes in the overall result of the speech recognition. Due to the fact that the testing data contained a large amount of names and surnames, we achieved additional decreasing, relatively 3% WER and more than 10% of PPL in the case of word classes. Augmentation statistics of trigrams did not improve resultant LM, because mentioned database does not contain any text data from the field of judicature. The contribution of mentioned optimization steps to the domain-specific task of Slovak LVCSR yield overall reduction approximately 24% in WER and 56% of PPL, relatively.

# **7.3 Discussion**

Using selected methods, principles and approaches in statistical modeling of the Slovak language and proposed optimization techniques we achieved the recognition accuracy of our LVCSR system almost 94% in domain-specific task from the field of judicature and approximately 90% in the case of broadcast news transcription. The vocabulary used in experiments covers about 99% commonly used words in the Slovak language.

As regards the experimental results, the recognition accuracy could be increased by extending word classes with names of cities, streets, institutions, and other name entities in their inflected form. Regarding memory requirements, it could be more suitable to use only class-based approach in Slovak language modeling. However, absence of any available morphological tagger for Slovak language limits the utilization of this approach, although first steps in this area have already been done.

Contemporary research in Slovak language modeling is also oriented on different areas such as vocabulary selection in specific domain, topic detection in web corpora, augmentation statistics of the LM using machine translation systems or web engines, on-line adaptation of LMs, modeling of unknown words in spontaneous speech, morphologically motivated class-based modeling, discovering the influence of the morpheme-based models, and eliminating errors caused by used vocabulary or language modeling in speech recognition.

As regards the real application of domain-oriented speech recognition, nowadays, a new version of our LVCSR system for the purpose of the Ministry of Justice of the Slovak Republic is being finalized, in which these knowledges about the modeling of the Slovak language and LMs described in this chapter have been used. It is important to say, that at the time of the preparation of this chapter proposed LVCSR system has been installed and used by more than 50 persons (judges, court assistants and technicians) at 9 different institutions belonging to the Ministry of Justice for testing. The results of tests will be taken into consideration in the final version of the Slovak LVCSR system coming into everyday use at the organizations belonging to the Ministry of Justice of the Slovak Republic by the end of the year 2011.

# **8. Conclusion**

In this chapter a brief summary of current methods and principles used in Slovak language modeling has been presented. By combination of standard statistical methods and proposed language dependent optimization techniques bringing an additional information into training process of LM, often linguistic regularities as well, we achieved notable improvement in recognition accuracy of our LVCSR system of the Slovak language in the task of broadcast news transcription as well as in domain-specific speech recognition from the field of judicature. We have discovered that using several different approaches oriented to the specific problem in language modeling, we can better eliminate errors arising in the speech recognition of such inflective language as is the Slovak language. The major contribution in the area of Slovak language modeling is the fact that current language models are also used in development and application of the Slovak automatic transcription and dictation LVCSR system for the judicial domain.

#### **9. Acknowledgement**

The research presented in this paper was supported by the Ministry of Education under research projects VEGA-1/0065/10 and MŠ SR 3928/2010-11 and by EU ICT Project INDECT (FP7–218086).

### **10. References**


Zhu, X. & Rosenfeld, R. (2001). Improving trigram language modeling with the world wide web, *Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, ICASSP'2001*, Salt Lake City, Utah, USA, pp. 533–536.

**New Technologies - Trends, Innovations and Research** Edited by Prof. Constantin Volosencu

ISBN 978-953-51-0480-3 Hard cover, 396 pages **Publisher** InTech **Published online** 30, March, 2012 **Published in print edition** March, 2012

The book "New Technologies - Trends, Innovations and Research" presents contributions made by researchers from the entire world and from some modern fields of technology, serving as a valuable tool for scientists, researchers, graduate students and professionals. Some practical applications in particular areas are presented, offering the capability to solve problems resulted from economic needs and to perform specific functions. The book will make possible for scientists and engineers to get familiar with the ideas from researchers from some modern fields of activity. It will provide interesting examples of practical applications of knowledge, assist in the designing process, as well as bring changes to their research areas. A collection of techniques, that combine scientific resources, is provided to make necessary products with the desired quality criteria. Strong mathematical and scientific concepts were used in the applications. They meet the requirements of utility, usability and safety. Technological applications presented in the book have appropriate functions and they may be exploited with competitive advantages. The book has 17 chapters, covering the following subjects: manufacturing technologies, nanotechnologies, robotics, telecommunications, physics, dental medical technologies, smart homes, speech technologies, agriculture technologies and management.

#### **How to reference**

In order to correctly reference this scholarly work, feel free to copy and paste the following:

Jozef Juhár, Ján Staš and Daniel Hládek (2012). Recent Progress in Development of Language Model for Slovak Large Vocabulary Continuous Speech Recognition, New Technologies - Trends, Innovations and Research, Prof. Constantin Volosencu (Ed.), ISBN: 978-953-51-0480-3, InTech, Available from: http://www.intechopen.com/books/new-technologies-trends-innovations-and-research/recent-progress-indevelopment-of-language-model-for-slovak-lvcsr

#### **InTech Europe**

University Campus STeP Ri Slavka Krautzeka 83/A 51000 Rijeka, Croatia Phone: +385 (51) 770 447 Fax: +385 (51) 686 166 www.intechopen.com

#### **InTech China**

Unit 405, Office Block, Hotel Equatorial Shanghai No.65, Yan An Road (West), Shanghai, 200040, China Phone: +86-21-62489820 Fax: +86-21-62489821

© 2012 The Author(s). Licensee IntechOpen. This is an open access article distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.